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Abstract 

The Adaptive Multiple Importance Sampling (AMIS) algorithm is aimed at an optimal recycling 
of past simulations in an iterated importance sampling scheme. The difference with earlier adaptive 
importance sampling implementations like Population Monte Carlo is that the importance weights of 
all simulated values, past as well as present, are recomputed at each iteration, following the technique 
of the deterministic multiple mixture estimator of Owen and Zhou (2000). Although the convergence 
properties of the algorithm cannot be fully investigated, we demonstrate through a challenging banana 
shape target distribution and a population genetics example that the improvement brought by this 
technique is substantial. 

Keywords: adaptive importance sampling, sequential Monte Carlo, population Monte Carlo, particle 
filters, deterministic mixture weights, banana shape target, population genetics. 

1 Introduction 



Importance sampling (Rubinstein and Kroese 2007) is a well-established simulation method used to overcome 



the difficulties connected with the intractability of a target distribution II. Its shortcomings are also well- 
documented, in particular the degradation of its performances against the dimensionality of the problem. 
When computing importance weights, uji = ir(xi)/q(xi) associated with samples Xi ~ Q from an importance 
distribution Q, the distribution of those weights customarily deteriorates as the dimension of x increases, 
unless the importance distribution Q is tuned more finely against the target II. [7r(x) and q{x) are the 
densities of, respectively, the target and the importance distributions with respect to the same dominating 
measure v (typically the Lebesgue measure) and, Q is chosen such that II is absolutely continuous with respect 
to Q.] Since, in practical settings, the fine tuning of the importance distribution against the target is difficult, 
alternative Markov chain Monte Carlo approaches have often been advocated as being more appropriate for 



large dimensional problems (see Robert and Casella (2004 1 ) but recent attempts have been made to construct 



importance functions that automaticaly adapt to the target distribution based on earlier importance samples 
(see, e.g. Ortiz and Kaelbling] 2000 Liu et al. 2001 Pennanen and Koivu 2004 Rubinstein and Kroese 



2004). Those methods are called adaptive importance sampling but they also relate with particle filters 



2002 Del Moral et al. 2006) 



(Gordon et al. 1993 Doucet et al. 2001 ) and sequential Monte Carlo methods (Doucet et al. 2000 Chopin 
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There are many different strategics of adaptive importance sampling. For instance, the generic Population 
Monte Carlo (PMC) scheme of Cappe et al. (2004) can be implemented as the D-kernel (Douc et al. 2007a|b I 
algorithm, which tries to fit a mixture of D given kernels or importance sampling distributions in terms of 
either minimum variance or minimum Kullback-Leibler divergence. While this algorithm is shown to converge 
to the best possible solution (within the class of D kernels), it is concerned with a specific type of proposals 
and it still uses a small family of distributions that may fail to properly represent the target. 

In this paper, we propose a novel perspective to pool together importance samples. Those importance 
samples originate from different importance sampling distributions and are associated with corresponding 
importance weights, of the form 

u\ = 7r(xl)/q t (xl) , (1) 

where q t and 7r are proper densities and where x\ ~ Qt (0 < t < T , 1 < i < N t ). While those T samples 
can be put together by keeping these original importance weights (Robert and Casella, 2004, Chapter 14), 



there exists a stabilising alternative called deterministic multiple mixture due to Veach and Guibas (19951 



and popularised by Owen and Zhou ( 2000 1 



This alternative is akin to the defensive sampling approach of Hesterberg ( 1998 1 in that it consists in 
modifying the denominators of all importance weights lj\ from the density qt(x*) of the distribution under 
which each point x\ was truly simulated to a mixture of all the densities that have been used for the T 
different samples, namely 

1 T 

resulting into the deterministic mixture weight 



^2N t q t (xt). 



(2) 



This idea has originally been proposed by Veach and Guibas ( 1995 1 and 



£-ij=0 iv J t=0 i=l 



T N t 
t=0 i = l 



h(x)ir(x)v(dx) = E u (h(x)) . (3) 



The denomination of the deterministic mixture weights stems from the fact that the weights of the mixture are 
not estimated or varying over time. This is a major difference with the PMC schemes of Douc et al. ( 2007a|b I 
where the weights of the proposals are optimised against an efficiency criterion like the Kullback-Leibler 
divergence. 

The novelty in our approach called AMIS (for Adaptive Multiple Importance Sampling) is that, when 
compared with the previous works on multiple mixtures, our family (Qt) of importance sampling distributions 
is constructed sequentially and adaptively, in the sense that the importance sampling distribution used at 
each iteration t (1 < t < T) is based on the past t — 1 weighted samples. Therefore, at each step t of the 
algorithm, 

i. the weights of all (present and past) simulated variables x\ (1 < I < t , 1 < i < N t ) are modified, based 
on the current collection of proposals (importance sampling distributions) (Qi)o<i<t, and 

ii. the entire collection of importance samples is used to build the next importance function. 

Note that, while (JTTJ) is a classical feature of Population Monte Carlo, most implementations that construct 



Qt based on past iterations (|Douc et al. 
previous generation, t — 1. 



2007b Cappe et al. 2008 1 only use the sample produced at the 
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We detail the reasons for promoting multiple mixture importance sampling in and analyse some 
associated algorithms in f]3] while discussing their theoretical properties in <|4j The performances of the 
AMIS algorithm are tested in f|5] over a challenging banana shape target distribution and in Sj6] over a 
population genetic application. This latter type of model has motivated the development of the proposed 
methodology. Indeed, the likelihood of a genetic model most often is not tractable and regardless of the 
approximation method used, its derivation involves a non-negligible cost. 



2 Multiple mixtures 

The modification in the importance weights from the original ratios Q to the mixture ratios ^ may sound 
surprising or even paradoxical in that the simulated values (and therefore the distributions used to simulate 
those) have not changed. We thus detail in this section the motivations for using multiple mixtures. 

There exists a fundamental methodological difficulty in using several importance functions at once. In- 



o 



'No 



deed, if II is the target density and Qo, ■ ■ ■ , Qt are T different importance functions, samples x^ 
. . ., xj , . . . , xJj t that are simulated from these importance functions, with associated standard importance 
weights u>j = ir(xl) /qt(xl), can be merged together in that the empirical distribution function 



J2^s xi (x)/J2 



w2 



produces (in the marginal sense) an output approximatively distributed from the target tt. Unfortunately, 
this property is not sufficient to ensure that the resulting sample performs satisfactorily. For instance, if one 
of the importance functions q t is associated with an infinite variance in the weights w|, i.e. if E[(«;') 2 ] = +00 
for one < t < T , the potentialy very large weights resulting from this importance experiment will remain 
very large in the cumulated sample, no matter how efficient the other importance functions are. Therefore, 
the poorly performing sample will overwhelmingly dominate the other samples in the final approximation 
and thus ruins the overall performances of the method. The conclusion of this point is that the raw mixing of 
importance samples and of their importance weights, when using different proposals, can be quite harmful, 
when compared with using a single sample, even when most proposals are efficient. 

As discussed at large in Owen and Zhou (2000[), using deterministic mixture as a representation of the 



production of the simulated sample has the potential to exploit the most efficient proposals in the sequence 
Qo, . . . , Qt without rejecting any simulated value nor sample. The poorly performing importance functions 
are simply eliminated through the erosion of their weights 

/ 1 T 
AO/ T ^Niqitf) 

I 1^ 3 =Q N 3 1=0 

as T increases. Indeed, if qo is the poorly performing proposal, while the qi's (I > 1) are good approximations 
of tt, for a value x® such that n(x®) / q (x°) is large, because qo(%i) is small, ir(x°) / {N a qo{x° i ) + . . .+N T q T (x®)} 
will behave like n(x®)/{Niqi(x°) + ... + N T q T (x°)} and decrease to zero as T increases. 

The paradoxical feature of having several acceptable importance weights for the same simulated value 



is well-understood in the case of Rao-Blackwellisation (Robert and Casclla 2004) and of Population Monte 



Carlo (Iacobucci et al. 20091. In the setting of multiple mixtures, the argument is however more intricate 
than the marginalisation inherent to the Rao-Blackwellisation schemes in that only the unbiasedness aspect 
(in the restricted sense of (|3|) remains. The modification of the weights is indeed difficult to perceive as 
an extra Rao-Blackwell step that would eliminate the randomness related to the mixture structure, i.e. the 
selection of the component, since the cumulation of the samples from the components of the mixture is not 
a sample from the mixture. 
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3 The AMIS algorithm 



3.1 The generic AMIS algorithm 



As explained in the introduction, the idea at the core of the AMIS algorithm is that, for each time-step t 
that produces samples of sizes Nq,Ni,... ,Nt, we can update not only the weights uj\ of the N t current 
particles, x*, but also the weights io\ of all past particles x\, < I < t — 1. Our algorithm can thus be 
interpreted as a Rao-Blackwell type of importance sampling where the whole sample of X) J=o points can 
be envisioned of as being homogeneously sampled from a deterministic mixture made of the overall sum of 
proposals. (Once again, the term deterministic mixture is a misnomer in that the overall sample is not the 
outcome of a mixture simulation.) 

The major difference with various PMC versions (Cappe et al. 2004 Douc et al. 2007a|b Cappe et al. 



2008 ) is that every single simulated value is recycled and rcwcightcd at every step of our iterative algorithm 



by virtue of selecting the appropriate deterministic mixture. Indeed, at each iteration t of the algorithm, a 
new adaptive importance sampling distribution is constructed by using, not only the particles corresponding 
to the current iteration, but all the weighted particles, based on a well-chosen efficiency criterion as in earlier 
PMC versions (Cappe et al., 2008). In the most standard case when the proposal Q t is parameterised, 
i.e. when Q t is of the form Q(8 t ) within a parametric family of distributions {Q(8), 9 S ©}, the adaptivity 
consists in estimating 6 t by 6 t at each iteration, using all the weighted samples accumulated so far. This 
estimation is obtained by using a well-defined criterion. In the two next sub-sections, we present two 
different criteria. Intuitively, the precision of those estimators can thus only improve, because we use more 
and more points and because we stabilise the denominators in the importance weights. As detailed in Sj4] 
from a theoretical perspective, we can only establish partial convergence results for the AMIS estimator that 
corresponds to the first criterion. 

A pseudo-code representation of the generic AMIS algorithm is given as follows: 

Algorithm 1. Generic AMIS 
At iteration t = 0, 

1) Independently generate No particles x° (1 < i < iVo) from a carefully constructed proposal Qo- 

2) For 1 < i < No, compute 



d? = Noqo(x°i) and w? = tt(^) j (6°/N ) 



3) Compute the importance sampling parameter estimate 6 of the parametric family {Q(9),6 £ ©} using the 
weighted particles 

and a well-chosen estimation criterion. 



At iteration t = l,...,T 

1) Independently generate N t particles x\ (1 < i < N t ) as x\ ~ Q (j) j 



2) For 1 < i < Nt, compute the multiple mixture at x\ 

t 

Sj = Noqo(xl) +^ y Niq (x\; Q l 

! = 1 

and derive the importance weight of particle x\, 



J=0 
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3) For < I < t — 1 and 1 < i < Ni, actualise the past importance weights as 

S\ = $1 + q (x\; 0* ^ and ui\ = tt (x l A 

4) Compute the parameter estimate using all the weighted particles 

({x1,UJi} {x° Ng ,uj% } {x\,u{} {a;V t ,^Ar t }) 
and the same estimation criterion. 



After T iterations of the AMIS algorithm, for any II-integrable application h, the self-normalized AMIS 
estimator of En(h(x)) = J h{x)it(x)v{dx) is: 

1 T N t 

E n (h(x))= , EE^)' (4) 

Z^t=0 Z-^i=l W i t=0 i=l 

In most practical settings, a self-normalized estimator is used because the density of the target distribution 
is known only up to a normalizing constant. In that case, the importance weights can only be evaluated 
up to this normalizing constant and, by construction, the self-normalized estimator does not depend on this 
constant. Moreover, in a lot of cases, the self-normalized estimator has a smaller Mean Square Error than 
the not self-normalized one. In the sequel, we always used the self-normalized AMIS estimator, even for the 
benchmark banana shape target considered in S|5] 



. ' 7=0 



3.2 Student's t proposal 



Since the above algorithm is set in generic terms, we now describe a special case that applies to many 
settings and can be seen as a vanilla AMIS algorithm. As in the most recent PMC algorithm of Cappe 
et al. (2008), the proposal distribution Q t is a Student's t proposal, E) whose mean fi and covariance E 



parameters are updated by estimating both first moments of the target distribution II using self- normalized 
AMIS estimators: 



uj\x\ 



and E* 



1^1=0 l^i=l U i 



/ yr 



(Note that the degrees of freedom of the t distribution are always set to 3 as the lowest value allowing for 
finite first moments but they could also be estimated at each iteration.) 

Quite obviously and as illustrated by the next section, more elaborate proposals are possible, depending 
on the information available on II. For instance, if the potential for multimodality of the target II is high 
enough, a mixture of Student's t distributions as inlCappe et al. (20081 would be more appropriate. When 



dealing with a Baycsian hierarchical model, creating classes (or blocks) of components of the parameter in 
agreement with the hierarchical levels (as in Gibbs sampling) and designing the Student's t proposals block 
by block should also be more efficient. In the case of bounded simulation spaces, a Beta distribution proposal 
or a mixture of Beta distributions are obviously more relevant than heavy tailed Student's distributions. 

Similarly, matching the expectation and the covariance structure of the Student's proposal distribution 
with both first moments of the target distribution is only one among many efficiency criteria that can be 
used to calibrate the parameters of the proposal distribution at each step of the algorithm. For instance, as 
done in the next section, we can alternatively minimise the Kullback-Lcibler divergence between the target 



and the proposal distribution following the approach of Cappe et al. ( 2008 1 . 
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Although wc do not elaborate on this possible improvement, note also that, once this weighted sample 
based on Ylt=o Nt simulations is obtained following the steps detailed in the pseudo-code below, it is possible 
to apply a final clustering (standard) algorithm on this sample, based on a Gaussian mixture representation. 
Those clusters can be used to estimate local covariance and mean structures and then simulate a final and 
global sample based on the cluster representation but using Student's t distributions. Because all weights 
are controlled, we can then merge this final sample with the sequence of earlier samples without losing the 
deterministic representation. Once again, this is only one possible extension of the algorithm and it will 
only bring an improvement in settings when the Gaussian mixture clustering makes sense, i.e. when the 
components of the vector to be simulated are sufficiently homogeneous. In §3.3| we present a special AMIS 
algorithm based on mixture of Gaussian distributions. 

We provide below a pseudo-code representation of the Student's t version of the AMIS algorithm. In this 
code, t3 (x; /i, S) denotes the density of the ^(/x, E) distribution. 



Algorithm 2. Student's t AMIS 
At iteration t = 0, 

1) Independently generate No particles x° (1 < i < iVo) from a carefully constructed proposal Qq. 

2) For 1 < i < No, compute 

5° = N oqo (x° t ) and uj° = tt(x°) / (5°/N ) . 

3) Derive the weighted empirical first and second moments (i? and E° of this sample. 

At iteration t = l,...,T 

1) Generate independently Nt particles x\ (1 < i < N) from the d-variate Student's t, x\ ~ T3 (fi t ~ 1 , E*~ 

2) For 1 < i < N, compute the multiple mixture at x\ 



8j = N q (xl) + Nit 3 (x\; p} \E ! 1 
1=1 



and derive the importance weight of particle x\, 



Ui=n [x, \ 



J=0 



3) For 1 < I < t — 1 and 1 < i < N, actualize the past importance weights as 

S\ = 5\ +*a (x\; /t t_1 ,E t_1 ^ and u)\ = it (x • j / 

4) Compute p,* = (fb\, . . . , /t^) as 



3=0 



t JV, 



t N t 



and 



t N t 



t N, 



E* = ^T^Tulixi - [i){x\ - [a) 1 EE w *' 



G 



3.3 Mixtures of Gaussian distributions 



In this second special version of the AMIS algorithm, the importance functions are mixtures of multivariate 
Gaussian densities, 



q(x;9) 



k 

E 

i=\ 



Pi<p(x; 



k 



where (/?(•; /U, £) denotes a multivariate Gaussian density with mean /i and covariance matrix S, as in the 
D-kernel approach to PMC algorithms of Cappe et al. (20081. We also use the Kullback-Leibler divergence 



between II and Q to update the parameter 9 = (pi 



div(n,Q(0)) 



q{x;9) 



In this case, the best choice for the parameter 9 is the maximum likclihoood estimate of 9 and, in the AMIS 
setting, this means using at iteration t the whole sequence of samples x\ (0 < I < t) wi th their updated 



weights u>\ inside a weighted EM algorithm, which is solved using the mixmod software (Bicrnacki et al. 



2006). The number k of components used for the mixture can be either set in advance or, more realistically, 
estimated at iteration t = by the ICL criterion of Biernacki et al. ( |2000[ ) and a substantial number Nq 
of iterations. We do not reproduce the earlier pseudo-codes for this special case since the differences are 
obviously minimal. Note that the extension to a mixture of t densities is equally feasible since there exists 



a corresponding EM algorithm ( |Peel and McLachlan 2000 1 



3.4 Initialisation 

A major difficulty with all adaptive importance algorithms is that the starting distribution is paramount to 
achieve proper performances. According to the "what-you-get-is-what-you-see" features of such algorithms, 
it is quite difficult to recover from a poor starting sample, the adaptivity only focussing on the visited 



parts of the simulation space (see, e.g. Iacobucci et al. 2009). Therefore, we insist on the requirement that, 



in the implementation of our algorithm, a significant part of the computing effort must be spent on the 
initialization stage. A generic initialisation solution is to first implement an hypercube reparameterisation 
of the simulation space (via, for instance, an inverse logistic tranformation that maps all components of the 
vector to be simulated into the unit interval), to sample from a uniform distribution over this hypercube 
and then rescale this sample via a logistic transform scaled by imposing an acceptable Effective Sample 
Size (ESS) for the importance weights of the transformed sample. The ESS, deduced using self-normalized 
importance sampling estimators, indicates to which number of iid simulations an importance sample can be 
assimilated. If we dispose of an importance sample of size N having used the importance distribution Qq, 



the ESS is given by (Hesterberg 1998 Liu 2001) 



N 



l + V Qo W(x)/qo(x)] ■ 

This measure of efficiency does not depend on h and, in practice, 

^Qo{ n ( x )/lo(x)} = J {n(x)/qo(x) - E Qo [tt(x) / q (x)}} 2 q (x)v(dx) 

can be estimated using the coefficient of variation of the importance weights. 

Note that to maximize the ESS is equivalent to minimize the variance of the importance weights. More- 
over, in order to maximize the ESS, we only need simulate one such sample since we can adapt the scale 
for this sample. Obviously, this solution is far from being fool-proof and we favour an informed alternative 
implementation as soon as pieces of information on the target distribution are a priori available. 
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4 Convergence issues 



While establishing unbiasedness and convergence of the deterministic mixture estimator of Owen and Zhou 



( 2000 ) is straightforward, the introduction of an adaptive mechanism in the construction of the sequence 
of proposals, highly complicates both issues. The estimator is no longer unbiased and its convergence (in 
T for a fixed values of N t ) cannot be established without imposing restrictions on the simulation space. In 
order to explain those difficulties, we concentrate on the Student's t AMIS algorithm. We first define in §4.1| 
a stylised version of this scheme, before producing an unbiased version in j |4.2| along with a convergence 
result under compacity conditions. We stress that the goal of this section is to establish that the algorithm 
is viable in the "small Nt, large T" domain, in opposition to earlier PMC algorithms that were validated in 
the "large N t at each iteration" domain by standard importance sampling arguments. (And so is the AMIS 
estimator, in the sense that, at each iteration t, the estimator is converging when Nt is going to infinity.) 

4.1 A simplified Student's t algorithm 

We start this section by defining a simplified version of the Student's t AMIS algorithm that sets a simple 
theoretical background for the study of our algorithm. First, we only consider the setting of Algorithm [2] 



in Section 3.2 We restrict ourselves to the extreme case N t — 1, meaning that each iteration sees only one 
simulation, i.e. the proposal is updated after each new iteration. We also simplify the update of the Student's 
t proposal by only considering learning about the mean /i*, the covariance matrix being set to an arbitrary 
value. In conjunction with this choice, we use the simplified notation t 3 {x; /i) to denote the corresponding 
Student's t density with mean /j, and similarly T 3 ([i) to denote the distribution. 



Algorithm 3. Skeleton version of the Student's t AMIS 
Produce a sequence (xq, . . . , xt) of simulations with: 

xi ~ T 3 (ui(x )) ui(x ) = n(x )x /qo(x ) = fi a , 

rr ft \\ f \ 7r(a:o)xo n{x\)X\ x 

X 2 ~ T 3 (U 2 {X 0:1 )) U 2 {X 0:1 ) = r— — — — r + = H , 

q (x ) + t 3 (x ; ui(xo)) qo(xi) + t 3 (xi, ui(x )) 



t-i 



x, ~- T 3 (M t (x 0:t _i)) M t (xo: t _i) = V — - — _ t _i , rr = /' 

fc=0 ^\ X k) + Li=l hiXk] Ui{x :i-l)) 



n(x k )x k 



The current estimator of n, the target mean, after iteration t is therefore 



= U t+ l(X ; t ) = ^ 



~f qo(x k ) + J2i=i h(xk]iH(xo-.i-i)) 



4.2 Unbiasedness and convergence of the simplified algorithm 

First, establishing the unbiasedness of the estimator //' for every t > 1 does not follow from the same 



argument as in the original version of Owen and Zhou (2000) because of the dependence of the importance 
weight of x t on subsequent x^-'s (j > t). 
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For t > 1, we have 



E[£*] = 



E 

t 

E 

fc=0 



E 



Tr(x k )x k 



qo(xk) + Y,i=i h{xk] Ui(x -.i-i)) 
ir(x k )xk 



t 3 (x k ;u k (x ;k-i))dx k 



<7o0fc) + Yh=i h(xk;Ui(xo:i-i)) 

X gk{x :k~l) dXQ:k-lhk(x k +l:t\Xk) dXk+l:t 

where g k (xo :k -i) is the joint distribution of the past simulations and h k (x k +i-.t\x k ) is the conditional dis- 
tribution of the future simulations given the current one. Due to this latter term, the full conditional 
distribution of Xk given the past and future simulations xo-.k-i and Xk+i-.t is no longer t 3 (x k ; u k (xQ :k _i)) and 
this modification implies that fi 1 is biased. 

A simple modification of the above stylised algorithm brings a solution to the bias difficulty, by using 
two different simulation threads: 



Algorithm 4. Unbiased stylised AMIS 
Produce two sequences xq, . . . ,xt and xq, . . . ,Xt as 



%0i Xq 

x\,ii 

X2,X 2 



x t ,x t 



9oO) , 

%(ui(x )) Wi(^o) = ir(x )x /q (x ) = jl° , 

q {xo) + t 3 (x ;ui(x )) qo(xi) + t 3 (xi; Ui{x )) 



7r(xi)ii 



%(u t (x Q :t-l)) U t (i :t-l) 



t-1 

E 



7r(£fe)i fe 



fc=0 t?o(5fe) + Z) i= i *3(5fc; Ui(X0:i-l)) 



A*" 1 . 



The new estimator 



En(x k )x k 
j 

fc=0 9oOfc) + Y,i=i t3(xk;ui(x :i-i)) 



of [i = En(x) is then unbiased because it is so [unbiased] conditional on the second sequence {^oii-ili being 
a standard deterministic mixture estimator (lOwen and Zhou 2000 ) . Note that we are not advising the use 



of this double sequence for a practical implementation: this is simply a mathematically convenient device to 
achieve some advances on the convergence of the estimator. 

Using the decomposition V(/2') = E[V(//*|io : t-i)] + V(E[/2*|io : t-i]), it is furthermore straightforward to 
check that 



V(m*)=E[V(/iW-i)] = 5> 



k=0 

<±E 

fc=0 



E 



n{x k )x k 



qo(xk) + h(xk;ui(x :i-i)) 
ir 2 (x k )x 2 k t 3 (x k ; u k (x ;k-l)) 



{qo(x k ) + Y,i=i h{x k ;ui{xo;i-i))} 2 
ir 2 (x)x 2 



XQ-.t-l 



dx k 



qo(x) + J2i=i h{x; Ui{xQ-i-x)) 



Ax 
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Therefore, the variance of the simplified Student's t AMIS estimator ft is strictly decreasing at each iteration. 
That it converges to with T can only be established when -k{x) /t^{x; u) is bounded uniformly in u, as 
for instance when the support of 7r is bounded. A more general convergence result implies to study the 
distribution of the ztj(xo : j_i)'s and this proves to be incredibly challenging. 

5 A banana shape target example 

In order to compare the performance of the AMIS algorithm with the existing, we ressorted to the benchmark 



target density set by Haario et al. (1999 20011, which can be calibrated to be extremely challenging. The 
target density is based on a centered ^-multivariate Gaussian, x ~ A/l(0j,,E) with covariance matrix £ = 
diag(er 2 , 1, . . . , 1) which is twisted by a change of variable in the second coordinate from x-i to x-i — b{x\ — a 2 ). 
(Other coordinates remain unchanged.) This change of variable leads to a twisted (or banana shaped) density 
that is centered with uncorrelated components. Since the Jacobian of the twisting transformation is equal 
to 1, the target density is: 

t(x) = /jV(0„E) (xi,x 2 + b{x\ - a 2 ),x 3 , ...,x p ) , 

where /^(o ,£)(") denotes the density of the centered p-multivariate Gaussian distribution with covariance 
E. One of the appeals of this benchmark is to allow for various degrees of heavy tails while allowing to work 
in high dimensional settings. 

For our comparison, we only consider a mild example of banana shape density, with a 1 — 100 and 
b = 0.03. Figure [T] shows how twisted the resulting marginal distribution of (xi,^) is for this choice of 
parameters (it does not depend on the dimension p). (More twisted distributions, i.e. ones with fatter tails 
can be obtained by using higher values of b and/or a 2 .) In our case, the target distribution satisfies E(xi) = 
for all i = 1, . . . ,p, V(xi) = 100, V(x 2 ) = 19, and V(xj) = 1 for alii = 3, . . . ,p. 




Figure 1: Banana shape benchmark: marginal distribution of (xx,X2) for the parameters o\ = 100 and 
b = 0.03. Contours represent 60% (red), 90% (black) and 99.9% (blue) confidence regions in the marginal 
space. 



10 



Target function 




AMIS 


"l\T/~\r~ri A TV , r T O 

NOT AMIS 




5 


0.06558 


0.06879 




1 n 


u.uuooo 


n 1 1 n^i 

U. 1 lUOl 




20 


09167 


0.17912 




5 


0.10215 


11583 


K(xr>) = 


10 


0.21421 


22557 




20 


0.25316 


0.29087 


2^,1=3 ^\ x i) — u 


o 


U.UUt I O 




V^IO TTTi /" \ r, 

Ei=3 M^iJ - U 


1 n 
1U 


u.uuyu/ 




Ej=3 E 0i) = 


20 


0.01666 


0.04208 




5 


2.60672 


3.92650 


v ) — 1UU 


1 n 




7 ZLSS77 
/ .'too / / 




20 


8 20020 


9.71725 




r. 
o 






V(z 2 ) = 19 


10 


3.76660 


5.08474 




20 


4.85407 


5.98031 


E?= 3 V(x i )=3 


5 


0.00645 


0.01196 


Ei 1 £ 3 V(x i )=8 


10 


0.01370 


0.02636 


Ef= 3 V(xi) = 18 


20 


0.04609 


0.06424 



Table 1: Root mean square errors calculated over 10 replications of the AMIS and NOT- AMIS schemes for 
different target functions, given with their expectations for different dimensions p. 



Considering such a target, we compare an iterative importance sampling algorithm that uses the clas- 
sical and not the deterministic mixture version with the Gaussian mixture version of the AMIS algorithm, 
described in Section [3~3| The reference algorithm, called NOT AMIS, uses the past simulations as well for 
creating a new Gaussian mixture proposal, but with the usual importance weights. Given the recent work 
on PMC algorithms (Cappe et al. 20081, it can be considered as state-of-the-art for the comparison. The 



results of this experiment are reported in Table [T] and on Figures [2}|4] 

The details of the experiment are as follows. For both schemes, an initial sample of No — 10 5 particles is 
simulated from a standard logistic distribution and rescaled component- wise to ensure a maximal ESS. This 
optimisation problem is solved via a standard simplex algorithm. In the following, T = 10 iterations and 
Ni = 10, 000 particles are used. The clustering step fitting a mixture to the weighted samples is solved via 
the mixmod software (Biernacki et al. 2006 1, with the number of components in the mixture being calibrated 
via the ICL criterion (Biernacki et al. 2000) during the first iteration. Both schemes take approximatively 
the same computing time (depending of course on the dimension p of the problem) and end up with weighted 
2 x 10 5 particles. 

The results shown in the different figures as well as in Table [T] are all consistent with a domination of 
the AMIS scheme. The gain in ESS is quite spectacular, but justified by the strong stabilisation brought 
by the AMIS averaging. The improvement in root mean square error shown in Table [l] typically varies with 
the target function as well as with the overall dimension p, but may go as far as a threefold reduction. The 
boxplots of the absolute errors convey the same message of a uniform domination of AMIS in this setting. 
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Figure 2: Banana shape example: boxplots of the 10 replicate ESSs for the AMIS scheme (left) and the 
NOT-AMIS scheme (right) for p = 5, 10, 20. 



Figure 3: Banana shape example: boxplots of the 10 replicate absolute errors associated to the estimations 
of E(xi) (left column), E^) (center column) and X)f=3^( x i) (right column) obtained by using the AMIS 
scheme (left in each block) and the NOT- AMIS scheme (right in each block) for p = 5, 10,20. 
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Figure 4: Banana shape example: boxplots of the 10 replicate absolute errors associated to the estimations 
of V(xi) (left column), V(x2) (center column) and X^=3 ^( x i) (right column) obtained by using the AMIS 
scheme (left in each block) and the NOT- AMIS scheme (right in each block) for p = 5, 10,20. 



6 An example in Population Genetics 

To illustrate the potential of the AMIS algorithm, we used it in a population genetics problem which amounts 
to estimate parameters of an evolutionary scenario in which two populations have diverged from a common 
and unknown ancestral population. Data consist of the genotypes at a single microsatellite locus of 50 
diploid individuals sampled from each population. This locus is assumed to evolve according to the strict 
Stepwise Mutation model (SMM), i.e., when a mutation occurs, the number of repeats of the mutated gene 
increases or decreases by one unit with equal probability. After divergence, we also assume that populations 
do not exchange gene (no migration). The four parameters to estimate are the three effective population 
sizes (n\,n2, riAnc) and the time of divergence (tdiv), all scaled by the mutation rate (fi) of the locus : B\ 
(=4ni/i), #2 (=4ri2/x), 9a (=^nAnclJ-) and r (—tdivf)- The likelihood of such a model is costly to obtain, 
which is why we selected this benchmark example. 



Ten data files have been simulated with the software DIYABC (Cornuet et al. 20081, with the following 
values of parameters : m = riAnc = 10,000, n 2 = 2,000, tdiv = 1,000, and // = 0.0005, leading to 
d\ = Q a — 20, 02 — 4, and r = 0.5. Each dataset has been submitted to two types of analyses. The first 
analysis, used as a control, is based on an MCMC in which the gene tree of the sampled genes is updated 



together with the four demographic parameters. This has been performed with the software IM (Hey and 



Nielsen 20041. The second analysis combines the AMIS algorithm and an estimation of the likelihood based 



on importance sampling (IS) for gene genealogies in the same way as Beaumont (20031 embedded an IS 
computation of the likelihood in a MCMC exploration of the parameter space. We note that the likelihood 
of a set of demographic parameters is computed by averaging importance weights of gene trees simulated 
event by event according to proposal distributions and parameter values. Each gene tree is built in three 
steps looking backward in time: i) between present time and time of divergence, lineages are coalesced or 
mutated following Stephens and Donnelly's algorithm (2000) ( Stephens and Donnelly| |2000[ ), monitoring 
times of events as in Beaumont (20031, ii) at time of divergence, remaining lineages of both populations are 



merged and iii) after divergence, the gene tree is completed according to the SDPAC algorithm of Cornuet 



and Beaumont (20071. To assess the repeatability of both methods, each analysis was repeated four times 



(that means with four different random seeds for each dataset). Each MCMC (IM) was run as a single chain 
of 10 7 updates after a burn in period of 10 6 updates. The IS-AMIS was run with two AMIS iterations in 
addition to the initial step, 200,000 particles per iteration, and the likelihood of each particle estimated from 
only two gene genealogies. Uniform priors U[0.l, 100] and W[0.005,5] were chosen for parameters 9 and r. 
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Both methods provided congruent outputs as shown on Figure [5j thus validating the IS- AMIS approach. 
However the major conclusion of this study, whereas each MCMC run lasted about 2 hours, the IS-AMIS 
lasted only around 20 min with a slightly better repeatability in that MCMC outputs are often more variable. 
The calculation of the likelihood function of those population genetics models has a non-negligible cost. We 
use here an importance sampling approximation as in Stephens and Donnelly (2000) and the cost of such 
an approximation considerably increases with the number of simulated gene trees. This type of models is 
then adequate for the adoption and the development of the AMIS algorithm: all the particles simulated 
during the process are recycled, which minimizes the number of calls to the likelihood function. Due to this 
recycling process, the AMIS algorithm cannot be easily compared to other adaptive importance sampling 
schemes which do not naturally involve any recycling procedure. 



7 Conclusion 



We have investigated in this paper an adaptive importance sampling method that extends the scope of the 
original deterministic multiple mixture approach of Owen and Zhou ( 2000 1 in that the sequence of importance 
proposals builds on the samples produced so far. The generality of the AMIS algorithm is that it offers a 
super-efficiency compared with other adaptive importance sampling techniques by allowing for an integral 
recycling of the past simulations. It thus provides a scope for processing those heterogeneous simulations 
as a whole and for treating the computing cost Y^t=o N t as a, single entity. Even though we were unable to 
achieve a complete generality in the convergence properties of the method, we believe this is one of the few 
available algorithms that converge in T rather than in N t at each iteration t. 
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